{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 1000
    },
    "colab_type": "code",
    "id": "kUlQMiPZxfS_",
    "outputId": "cee17d53-c44c-4821-ebeb-4fa347c316b2"
   },
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.\n",
    "\n",
    "Instructions for setting up Colab are as follows:\n",
    "1. Open a new Python 3 notebook.\n",
    "2. Import this notebook from GitHub (File -> Upload Notebook -> \"GITHUB\" tab -> copy/paste GitHub URL)\n",
    "3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select \"GPU\" for hardware accelerator)\n",
    "4. Run this cell to set up dependencies.\n",
    "\"\"\"\n",
    "# If you're using Google Colab and not running locally, run this cell.\n",
    "!pip install wget\n",
    "!apt-get install sox libsndfile1 ffmpeg\n",
    "!git clone https://github.com/NVIDIA/NeMo.git\n",
    "NEMO_ROOT = !pwd\n",
    "NEMO_ROOT = NEMO_ROOT[0]\n",
    "!cd NeMo && pip install \".[all]\"\n",
    "!pip install unidecode"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "VgTR8CMlxu3p"
   },
   "source": [
    "# **SPEAKER RECOGNITION** \n",
    "\n",
    "Speaker Recognition (SR) is a broad research area which solves two major tasks: speaker identification (who is speaking?) and\n",
    "speaker verification (is the speaker who they claim to be?). In this work, we focus on far-field,\n",
    "text-independent speaker recognition when the identity of the speaker is based on how the speech is spoken,\n",
    "not necessarily in what is being said. Typically such SR systems operate on unconstrained speech utterances,\n",
    "which are converted into vectors of fixed length, called speaker embeddings. Speaker embeddings are also used in\n",
    "automatic speech recognition (ASR) and speech synthesis.\n",
    "\n",
    "As the goal of most speaker related systems is to get good speaker level embeddings that could help distinguish from\n",
    "other speakers, we shall first train these embeddings in end-to-end\n",
    "manner optimizing the [QuatzNet](https://arxiv.org/abs/1910.10261) based encoder model on cross-entropy loss.\n",
    "We modify the decoder to get these fixed size embeddings irrespective of the length of ithe nput audio. We employ a mean and variance\n",
    "based statistics pooling method to grab these embeddings."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "KzzOC5rpx9y6"
   },
   "source": [
    "In this tutorial, we shall first train these embeddings on speaker related datasets and then get speaker embeddings from a pretrained network for a new dataset. Since Google Colab has very slow read-write speeds, Please run this locally for training on [hi-mia](https://arxiv.org/abs/1912.01231). \n",
    "\n",
    "We use the [get_hi-mia-data.py](https://github.com/NVIDIA/NeMo/blob/master/scripts/get_hi-mia_data.py) script to download the necessary files, extract them, and also re-sample to 16Khz if any of these samples are not at 16Khz. We provide scripts to score these embeddings for a speaker-verification task like hi-mia dataset at the end. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 119
    },
    "colab_type": "code",
    "id": "UO_hAhMx0rwv",
    "outputId": "493bd23a-d07a-46db-e634-d38a09f70ef3"
   },
   "outputs": [],
   "source": [
    "import os\n",
    "try: NEMO_ROOT = os.path.join(NEMO_ROOT, 'NeMo')\n",
    "except NameError: NEMO_ROOT = os.path.abspath(\"../../../\")\n",
    "print(NEMO_ROOT)\n",
    "print(os.getcwd())\n",
    "\n",
    "data_dir = 'data'\n",
    "os.makedirs(data_dir, exist_ok=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Download and process dataset. This will take a few moments...\n",
    "### WARNING: This dataset is over 45GB. It will take a long time to download and process. We do not reccomend doing\n",
    "### on colab.\n",
    "!python {NEMO_ROOT}/scripts/get_hi-mia_data.py --data_root $data_dir"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After the download and conversion, your `data` folder should contain directories with manifest files as:\n",
    "\n",
    "* `data/<set>/train.json`\n",
    "* `data/<set>/dev.json` \n",
    "* `data/<set>/{set}_all.json` \n",
    "\n",
    "For each set, we create utt2spk files, these files would be used later in PLDA training.\n",
    "\n",
    "Each line in the manifest file describes a training sample - `audio_filepath` contains the path to the wav file, `duration` it's duration in seconds, and `label` is the speaker class label:\n",
    "\n",
    "`{\"audio_filepath\": \"<absolute path to dataset>/data/train/SPEECHDATA/wav/SV0184/SV0184_6_04_N3430.wav\", \"duration\": 1.22, \"label\": \"SV0184\"}` \n",
    "\n",
    "`{\"audio_filepath\": \"<absolute path to dataset>/data/train/SPEECHDATA/wav/SV0184/SV0184_5_03_F2037.wav\", duration\": 1.375, \"label\": \"SV0184\"}`\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "F4rBMntjpPph"
   },
   "source": [
    "Import necessary packages"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 187
    },
    "colab_type": "code",
    "id": "4mSWNvdZPIwR",
    "outputId": "83455882-4924-4d18-afd3-d2c8ee8ed78d"
   },
   "outputs": [],
   "source": [
    "from ruamel.yaml import YAML\n",
    "\n",
    "import nemo\n",
    "import nemo.collections.asr as nemo_asr\n",
    "import copy\n",
    "from functools import partial"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "CeKfJQ-YpTOv"
   },
   "source": [
    "# Building Training and Evaluation DAGs with NeMo\n",
    "Building a model using NeMo consists of \n",
    "\n",
    "1.  Instantiating the neural modules we need\n",
    "2.  specifying the DAG by linking them together.\n",
    "\n",
    "In NeMo, the training and inference pipelines are managed by a NeuralModuleFactory, which takes care of checkpointing, callbacks, and logs, along with other details in training and inference. We set its log_dir argument to specify where our model logs and outputs will be written, and can set other training and inference settings in its constructor. For instance, if we were resuming training from a checkpoint, we would set the argument `checkpoint_dir=<path_to_checkpoint>`.\n",
    "\n",
    "Along with logs in NeMo, you can optionally view the tensorboard logs with the create_tb_writer=True argument to the NeuralModuleFactory. By default all the tensorboard log files will be stored in {log_dir}/tensorboard, but you can change this with the tensorboard_dir argument. One can load tensorboard logs through tensorboard by running `tensorboard --logdir=<path_to_tensorboard dir>` in the terminal."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "uyn2xrR7R1K_"
   },
   "outputs": [],
   "source": [
    "exp_name = 'quartznet3x2_hi-mia'\n",
    "work_dir = './myExps/'\n",
    "neural_factory = nemo.core.NeuralModuleFactory(\n",
    "    log_dir=work_dir+\"/hi-mia_logdir/\",\n",
    "    checkpoint_dir=\"./myExps/checkpoints/\" + exp_name,\n",
    "    create_tb_writer=True,\n",
    "    random_seed=42,\n",
    "    tensorboard_dir=work_dir+'/tensorboard/',\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "k-juqc40p8KN"
   },
   "source": [
    "Now that we have our neural module factory, we can specify our **neural modules and instantiate them**. Here, we load the parameters for each module from the configuration file. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 34
    },
    "colab_type": "code",
    "id": "mC-KPOy-rpLA",
    "outputId": "1d902505-6e35-4eb8-aebf-8401c1bdd39c"
   },
   "outputs": [],
   "source": [
    "from nemo.utils import logging\n",
    "yaml = YAML(typ=\"safe\")\n",
    "with open(os.path.join(NEMO_ROOT,'examples/speaker_recognition/configs/quartznet_spkr_3x2x512_xvector.yaml')) as f:\n",
    "    spkr_params = yaml.load(f)\n",
    "\n",
    "sample_rate = spkr_params[\"sample_rate\"]\n",
    "time_length = spkr_params.get(\"time_length\", 8)\n",
    "logging.info(\"max time length considered for each file is {} sec\".format(time_length))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "5VgzNS1lrrqS"
   },
   "source": [
    "Instantiate the train data_layer using config arguments. `labels = None` automatically creates output labels from the manifest files, if you would like to pass those speaker names you can use the labels option. So while instantiating eval data_layer, we can pass labels to the class in order to match same the speaker output labels as we have in the training data layer. This comes in handy while training on multiple datasets with more than one manifest file. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 153
    },
    "colab_type": "code",
    "id": "dC9QOenNPoUs",
    "outputId": "786aac99-57f6-4066-e9a3-4908dc6e3d7a"
   },
   "outputs": [],
   "source": [
    "train_dl_params = copy.deepcopy(spkr_params[\"AudioToSpeechLabelDataLayer\"])\n",
    "train_dl_params.update(spkr_params[\"AudioToSpeechLabelDataLayer\"][\"train\"])\n",
    "del train_dl_params[\"train\"]\n",
    "del train_dl_params[\"eval\"]\n",
    "\n",
    "batch_size=64\n",
    "data_layer_train = nemo_asr.AudioToSpeechLabelDataLayer(\n",
    "    manifest_filepath=os.path.join(data_dir, 'train/train.json'),\n",
    "    labels=None,\n",
    "    batch_size=batch_size,\n",
    "    time_length=time_length,\n",
    "    **train_dl_params,\n",
    ")\n",
    "\n",
    "eval_dl_params = copy.deepcopy(spkr_params[\"AudioToSpeechLabelDataLayer\"])\n",
    "eval_dl_params.update(spkr_params[\"AudioToSpeechLabelDataLayer\"][\"eval\"])\n",
    "del eval_dl_params[\"train\"]\n",
    "del eval_dl_params[\"eval\"]\n",
    "\n",
    "data_layer_eval = nemo_asr.AudioToSpeechLabelDataLayer(\n",
    "    manifest_filepath=os.path.join(data_dir, 'train/dev.json'),\n",
    "    labels=data_layer_train.labels,\n",
    "    batch_size=batch_size,\n",
    "    time_length=time_length,\n",
    "    **eval_dl_params,\n",
    ")\n",
    "\n",
    "data_preprocessor = nemo_asr.AudioToMelSpectrogramPreprocessor(\n",
    "    sample_rate=sample_rate, **spkr_params[\"AudioToMelSpectrogramPreprocessor\"],\n",
    ")\n",
    "encoder = nemo_asr.JasperEncoder(**spkr_params[\"JasperEncoder\"],)\n",
    "\n",
    "decoder = nemo_asr.JasperDecoderForSpkrClass(\n",
    "    feat_in=spkr_params[\"JasperEncoder\"][\"jasper\"][-1][\"filters\"],\n",
    "    num_classes=data_layer_train.num_classes,\n",
    "    pool_mode=spkr_params[\"JasperDecoderForSpkrClass\"]['pool_mode'],\n",
    "    emb_sizes=spkr_params[\"JasperDecoderForSpkrClass\"][\"emb_sizes\"].split(\",\"),\n",
    ")\n",
    "\n",
    "xent_loss = nemo_asr.CrossEntropyLossNM(weight=None)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "9bAP70DqsXGY"
   },
   "source": [
    "The next step is to assemble our training DAG by specifying the inputs to each neural module."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 224
    },
    "colab_type": "code",
    "id": "1raBGmd5Vshl",
    "outputId": "33a128f4-193f-4913-9c82-fd27610dfb9a"
   },
   "outputs": [],
   "source": [
    "audio_signal, audio_signal_len, label, label_len = data_layer_train()\n",
    "processed_signal, processed_signal_len = data_preprocessor(input_signal=audio_signal, length=audio_signal_len)\n",
    "encoded, encoded_len = encoder(audio_signal=processed_signal, length=processed_signal_len)\n",
    "logits, _ = decoder(encoder_output=encoded)\n",
    "loss = xent_loss(logits=logits, labels=label)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "uwnZT8ycsYMa"
   },
   "source": [
    "We would like to be able to evaluate our model on the dev set, as well, so let's set up the evaluation DAG.\n",
    "\n",
    "Our evaluation DAG will reuse most of the parts of the training DAG with the exception of the data layer, since we are loading the evaluation data from a different file but evaluating on the same model. Note that if we were using data augmentation in training, we would also leave that out in the evaluation DAG."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 224
    },
    "colab_type": "code",
    "id": "sPPyiNtLWDyf",
    "outputId": "3c37a7dd-b85c-4d29-edfe-cfd0cd16abe4"
   },
   "outputs": [],
   "source": [
    "audio_signal_test, audio_len_test, label_test, _ = data_layer_eval()\n",
    "processed_signal_test, processed_len_test = data_preprocessor(\n",
    "    input_signal=audio_signal_test, length=audio_len_test\n",
    ")\n",
    "encoded_test, encoded_len_test = encoder(audio_signal=processed_signal_test, length=processed_len_test)\n",
    "logits_test, _ = decoder(encoder_output=encoded_test)\n",
    "loss_test = xent_loss(logits=logits_test, labels=label_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "8m7dz1-usp1S"
   },
   "source": [
    "# Creating CallBacks\n",
    "\n",
    "We would like to be able to monitor our model while it's training, so we use callbacks. In general, callbacks are functions that are called at specific intervals over the course of training or inference, such as at the start or end of every n iterations, epochs, etc. The callbacks we'll be using for this are the SimpleLossLoggerCallback, which reports the training loss (or another metric of your choosing, such as \\% accuracy for speaker recognition tasks), and the EvaluatorCallback, which regularly evaluates the model on the dev set. Both of these callbacks require you to pass in the tensors to be evaluated--these would be the final outputs of the training and eval DAGs above.\n",
    "\n",
    "Another useful callback is the CheckpointCallback, for saving checkpoints at set intervals. We create one here just to demonstrate how it works."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "LFlXnbRaWTVl"
   },
   "outputs": [],
   "source": [
    "from nemo.collections.asr.helpers import (\n",
    "    monitor_classification_training_progress,\n",
    "    process_classification_evaluation_batch,\n",
    "    process_classification_evaluation_epoch,\n",
    ")\n",
    "from nemo.utils.lr_policies import CosineAnnealing\n",
    "\n",
    "train_callback = nemo.core.SimpleLossLoggerCallback(\n",
    "    tensors=[loss, logits, label],\n",
    "    print_func=partial(monitor_classification_training_progress, eval_metric=[1]),\n",
    "    step_freq=1000,\n",
    "    get_tb_values=lambda x: [(\"train_loss\", x[0])],\n",
    "    tb_writer=neural_factory.tb_writer,\n",
    ")\n",
    "\n",
    "callbacks = [train_callback]\n",
    "\n",
    "chpt_callback = nemo.core.CheckpointCallback(\n",
    "    folder=\"./myExps/checkpoints/\" + exp_name,\n",
    "    load_from_folder=\"./myExps/checkpoints/\" + exp_name,\n",
    "    step_freq=1000,\n",
    ")\n",
    "callbacks.append(chpt_callback)\n",
    "\n",
    "tagname = \"hi-mia_dev\"\n",
    "eval_callback = nemo.core.EvaluatorCallback(\n",
    "    eval_tensors=[loss_test, logits_test, label_test],\n",
    "    user_iter_callback=partial(process_classification_evaluation_batch, top_k=1),\n",
    "    user_epochs_done_callback=partial(process_classification_evaluation_epoch, tag=tagname),\n",
    "    eval_step=1000,  # How often we evaluate the model on the test set\n",
    "    tb_writer=neural_factory.tb_writer,\n",
    ")\n",
    "\n",
    "callbacks.append(eval_callback)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "a8EFjLsWs_jM"
   },
   "source": [
    "Now that we have our model and callbacks set up, how do we run it?\n",
    "\n",
    "Once we create our neural factory and the callbacks for the information that we want to see, we can start training by simply calling the train function on the tensors we want to optimize and our callbacks! Since this notebook is for you to get started and since the an4 as dataset is small, it would quickly get higher accuracies. For better models use bigger datasets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 1000
    },
    "colab_type": "code",
    "id": "xHTEtz7yXVMK",
    "outputId": "bd53ae06-cd0d-4291-da66-1af3079cbd86"
   },
   "outputs": [],
   "source": [
    "# train model\n",
    "# As the dataset is a large dataset, we recommend doing the training using our example scripts instead of a notebook\n",
    "# After training the model, you can skip this cell, copy your trained checkpoints to `./myExps/checkpoints/quartznet3x2_hi-mia`\n",
    "# and continue in the next cell\n",
    "num_epochs=25\n",
    "N = len(data_layer_train)\n",
    "steps_per_epoch = N // batch_size\n",
    "\n",
    "logging.info(\"Number of steps per epoch {}\".format(steps_per_epoch))\n",
    "\n",
    "neural_factory.train(\n",
    "        tensors_to_optimize=[loss],\n",
    "        callbacks=callbacks,\n",
    "        lr_policy=CosineAnnealing(\n",
    "            num_epochs * steps_per_epoch, warmup_steps=0.1 * num_epochs * steps_per_epoch,\n",
    "        ),\n",
    "        optimizer=\"novograd\",\n",
    "        optimization_params={\n",
    "            \"num_epochs\": num_epochs,\n",
    "            \"lr\": 0.02,\n",
    "            \"betas\": (0.95, 0.5),\n",
    "            \"weight_decay\": 0.001,\n",
    "            \"grad_norm_clip\": None,\n",
    "        }\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "BB6s19pmxGfX"
   },
   "source": [
    "Now that we trained our embeddings, we shall extract these embeddings using our pretrained checkpoint present at `checkpoint_dir`. As we can see from the neural architecture, we extract the embeddings after the `emb1` layer. \n",
    "![Speaker Recognition Layers](./speaker_reco.jpg)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "oSIDu6jkym66"
   },
   "source": [
    "Now use the test manifest to get the embeddings. As we saw before, let's create a new `data_layer` for test. Use previously instiated models and attach the DAGs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 258
    },
    "colab_type": "code",
    "id": "5JqUVbKDY32a",
    "outputId": "dd835e02-8882-4287-9639-c249ac3dfc94"
   },
   "outputs": [],
   "source": [
    "eval_dl_params = copy.deepcopy(spkr_params[\"AudioToSpeechLabelDataLayer\"])\n",
    "eval_dl_params.update(spkr_params[\"AudioToSpeechLabelDataLayer\"][\"eval\"])\n",
    "del eval_dl_params[\"train\"]\n",
    "del eval_dl_params[\"eval\"]\n",
    "eval_dl_params['shuffle'] = False  # To grab  the file names without changing data_layer\n",
    "\n",
    "test_dataset = os.path.join(data_dir, 'test/test_all.json')\n",
    "data_layer_test = nemo_asr.AudioToSpeechLabelDataLayer(\n",
    "    manifest_filepath=test_dataset,\n",
    "    labels=None,\n",
    "    batch_size=batch_size,\n",
    "    **eval_dl_params,\n",
    ")\n",
    "\n",
    "audio_signal_test, audio_len_test, label_test, _ = data_layer_test()\n",
    "processed_signal_test, processed_len_test = data_preprocessor(input_signal=audio_signal_test, length=audio_len_test)\n",
    "encoded_test, _ = encoder(audio_signal=processed_signal_test, length=processed_len_test)\n",
    "_, embeddings = decoder(encoder_output=encoded_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "dwEifkD9zfpl"
   },
   "source": [
    "Now get the embeddings using `neural_factory.infer` command. It does a forward pass of all our modules and save our embeddings in `<work_dir>/embeddings`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 153
    },
    "colab_type": "code",
    "id": "wGxYiFpJze5h",
    "outputId": "dbbc7204-28bc-43e9-b6aa-f3f757f5d4b5"
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import json\n",
    "eval_tensors = neural_factory.infer(tensors=[embeddings, label_test], checkpoint_dir=\"./myExps/checkpoints/\" + exp_name)\n",
    "\n",
    "inf_emb, inf_label = eval_tensors\n",
    "whole_embs = []\n",
    "whole_labels = []\n",
    "manifest = open(test_dataset, 'r').readlines()\n",
    "\n",
    "for line in manifest:\n",
    "    line = line.strip()\n",
    "    dic = json.loads(line)\n",
    "    filename = dic['audio_filepath'].split('/')[-1]\n",
    "    whole_labels.append(filename)\n",
    "\n",
    "for idx in range(len(inf_label)):\n",
    "    whole_embs.extend(inf_emb[idx].numpy())\n",
    "\n",
    "embedding_dir = './myExps/embeddings/'\n",
    "if not os.path.exists(embedding_dir):\n",
    "    os.mkdir(embedding_dir)\n",
    "\n",
    "filename = os.path.basename(test_dataset).split('.')[0]\n",
    "name = embedding_dir + filename\n",
    "\n",
    "np.save(name + '.npy', np.asarray(whole_embs))\n",
    "np.save(name + '_labels.npy', np.asarray(whole_labels))\n",
    "logging.info(\"Saved embedding files to {}\".format(embedding_dir))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 34
    },
    "colab_type": "code",
    "id": "SKKVIb7e6vel",
    "outputId": "a3fa3703-da6c-4a07-c20c-c83df11a8f25"
   },
   "outputs": [],
   "source": [
    "!ls $embedding_dir"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "A_7S4Yja7A8V"
   },
   "source": [
    "# Cosine Similarity Scoring\n",
    "\n",
    "Here we provide a script scoring on hi-mia. Its trial file has the structure: `<speaker_name1> <speaker_name2> <target/nontarget>`. First copy the `trails_1m` file present in test folder to our embeddings directory"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!cp $data_dir/test/trials_1m $embedding_dir/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use the below command to output the EER% based on the cosine similarity score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!python $NEMO_ROOT/examples/speaker_recognition/hi-mia_eval.py --data_root $embedding_dir --emb $embedding_dir/test_all.npy --emb_labels $embedding_dir/test_all_labels.npy --emb_size 1024\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# PLDA Backend\n",
    "To finetune our speaker embeddings further, we used kaldi PLDA scripts to train PLDA and evaluate. From this point going forward, please make sure you installed kaldi and added KALDI_ROOT to your path.\n",
    "\n",
    "To train PLDA, we can either use the dev set or training set. Let's use the training set embeddings to train PLDA and further use this trained PLDA model to score the test embeddings. In order to do that, we should get embeddings for our training data as well. Similar to the above steps, generate the train embeddings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_dataset = os.path.join(data_dir, 'train/train.json')\n",
    "\n",
    "data_layer_test = nemo_asr.AudioToSpeechLabelDataLayer(\n",
    "    manifest_filepath=test_dataset,\n",
    "    labels=None,\n",
    "    batch_size=batch_size,\n",
    "    **eval_dl_params,\n",
    ")\n",
    "\n",
    "audio_signal_test, audio_len_test, label_test, _ = data_layer_test()\n",
    "processed_signal_test, processed_len_test = data_preprocessor(input_signal=audio_signal_test, length=audio_len_test)\n",
    "encoded_test, _ = encoder(audio_signal=processed_signal_test, length=processed_len_test)\n",
    "_, embeddings = decoder(encoder_output=encoded_test)\n",
    "\n",
    "eval_tensors = neural_factory.infer(tensors=[embeddings, label_test], checkpoint_dir=\"./myExps/checkpoints/\" + exp_name)\n",
    "\n",
    "inf_emb, inf_label = eval_tensors\n",
    "whole_embs = []\n",
    "whole_labels = []\n",
    "manifest = open(test_dataset, 'r').readlines()\n",
    "\n",
    "for line in manifest:\n",
    "    line = line.strip()\n",
    "    dic = json.loads(line)\n",
    "    filename = dic['audio_filepath'].split('/')[-1]\n",
    "    whole_labels.append(filename)\n",
    "\n",
    "for idx in range(len(inf_label)):\n",
    "    whole_embs.extend(inf_emb[idx].numpy())\n",
    "\n",
    "if not os.path.exists(embedding_dir):\n",
    "    os.mkdir(embedding_dir)\n",
    "\n",
    "filename = os.path.basename(test_dataset).split('.')[0]\n",
    "name = embedding_dir + filename\n",
    "\n",
    "np.save(name + '.npy', np.asarray(whole_embs))\n",
    "np.save(name + '_labels.npy', np.asarray(whole_labels))\n",
    "logging.info(\"Saved embedding files to {}\".format(embedding_dir))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As part of kaldi, we need `utt2spk` \\& `spk2utt` files to get the ark file for PLDA training. To do that, copy the generated utt2spk file from `data_dir` train folder to create the spk2utt file using \n",
    "\n",
    "`utt2spk_to_spk2utt.pl  $data_dir/train/utt2spk > $embedding_dir/spk2utt`\n",
    "\n",
    "Then run the below python script to get EER score using the PLDA backend scoring. This script does both data preparation for kaldi and PLDA scoring. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!python {NEMO_ROOT}/examples/speaker_recognition/kaldi_plda.py --root $embedding_dir  --train_embs $embedding_dir/train.npy --train_labels $embedding_dir/train_labels.npy --eval_embs $embedding_dir/all_embs_himia.npy --eval_labels $embedding_dir/all_ids_himia.npy --stage=1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here `--stage = 1` trains PLDA model but if you already have a trained PLDA then you can directly evaluate on it by using the `--stage=2` option.\n",
    "\n",
    "This should output an EER of 6.32% with minDCF: 0.455"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Performance Improvement\n",
    "\n",
    "To improve your embeddings performance:\n",
    "    \n",
    "* Add more data and Train longer (100 epochs)\n",
    "\n",
    "* Try adding augmentation –see config file\n",
    "\n",
    "* Use a larger model\n",
    "\n",
    "* Train on several GPUs and use mixed precision (on NVIDIA Volta and Turing GPUs)\n",
    "\n",
    "* Start with pre-trained checkpoints"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "collapsed_sections": [],
   "name": "Speaker_Recognition_dataset.ipynb",
   "provenance": [],
   "toc_visible": true
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
